Source: https://archive.ics.uci.edu/ml/datasets/Urban+Land+Cover

Data Set Information:

Contains training and testing data for classifying a high resolution aerial image into 9 types of urban land cover. Multi-scale spectral, size, shape, and texture information are used for classification.

Class is the target classification variable. The land cover classes are: trees, grass, soil, concrete, asphalt, buildings, cars, pools, shadows.

Attribute Information:

LEGEND Class: Land cover class (nominal) BrdIndx: Border Index (shape variable) Area: Area in m2 (size variable) Round: Roundness (shape variable) Bright: Brightness (spectral variable) Compact: Compactness (shape variable) ShpIndx: Shape Index (shape variable) Mean_G: Green (spectral variable) Mean_R: Red (spectral variable) Mean_NIR: Near Infrared (spectral variable) SD_G: Standard deviation of Green (texture variable) SD_R: Standard deviation of Red (texture variable) SD_NIR: Standard deviation of Near Infrared (texture variable) LW: Length/Width (shape variable) GLCM1: Gray-Level Co-occurrence Matrix [i forget which type of GLCM metric this one is] (texture variable) Rect: Rectangularity (shape variable) GLCM2: Another Gray-Level Co-occurrence Matrix attribute (texture variable) Dens: Density (shape variable) Assym: Assymetry (shape variable) NDVI: Normalized Difference Vegetation Index (spectral variable) BordLngth: Border Length (shape variable) GLCM3: Another Gray-Level Co-occurrence Matrix attribute (texture variable)

Note: These variables repeat for each coarser scale (i.e. variable_40, variable_60, ...variable_140).

It is weird that the training set is smaller than the test set. I will bring them together and separate them myself.

There is some class imbalance problem, even more than 3:1

Lasso

It seems like there is some overfitting. The train scores are considerably higher than test scores.

Decision Tree

The best "the minimal number of observations per tree leaf" is 4 and best complexity is 0.0. The accuracy score is better, however there is still some overfitting.

Random Forest

Although there is still some overfitting, these are the best results so far!

I believe the result could be further improved by searching better values for n_estimators.

Stochastic Gradient Boosting

There is some overfitting, however the results are improved even further.

The best results are found with GradientBoost model.

Feature selection could improve these results.

Let's check the confusion matrix of best model.